Automating Mechanistic Interpretability via Program Synthesis

post by Edy Nastase (edy-nastase) · 2025-04-17T10:58:46.748Z · LW · GW · 1 comments

Contents

1 comment

I have been researching for a while, and it seems to me that there isn't that much progress on "automating" MI using Program Synthesis. The only source I could find is a paper from Max Tegmark's lab. However, this paper has been about for quiet a while, and not that much progress has been happening since then (or maybe I am unaware of it).

The inspiration of using Program Synthesis comes from Stephen Casper [AF · GW]suggesting that program synthesis could lead to the process of automating MI. In the post itself, he mentions some papers that transform RL policies into programs, while the rest are some obscure methods. However, none of them are actually trying to reverse parts of neural networks into "programs"...

Now, I was thinking about doing some research on this myself, and see if I could end up with a simple prototype. Something like this could represent some of the initial steps:

What do you guys think? Why isn't there much research in this direction? What is missing from the above proposed plan?

1 comments

Comments sorted by top scores.

comment by Sergii (sergey-kharagorgiev) · 2025-04-17T12:28:36.638Z · LW(p) · GW(p)

Apperently it's more efficient to do it other way around, to compile programs into transformers, which are then useful as refecene and ground truth when analyzing "real" transformers.

See usage of TRACR in "Towards Automated Circuit Discovery for Mechanistic Interpretability" https://arxiv.org/pdf/2304.14997, for example.